Data Frames, Plotting and Dates

1  Adding Color to Plots

Color is often refered to as the third dimension of a 2-dimensional plot, because it allows us to add extra information to an ordinary scatterplot. Consider the graph of literacy and income. By examining boxplots, we can see that there are differences among the distributions of income (and literacy) for the different continents, and it would be nice to display some of that information on a scatterplot. This is one situation where factors come in very handy. Since factors are stored internally as numbers (starting at 1 and going up to the number of unique levels of the factor), it's very easy to assign different observations different colors based on the value of a factor variable.
To illustrate, let's replot the income vs. literacy graph, but this time we'll convert the continent into a factor and use it to decide on the color of the points that will be used for each country. First, consider the world1 data frame. In that data frame, the continent is stored in the column (variable) called cont. We convert this variable to a factor with the factor function. First, let's look at the mode and class of the variable before we convert it to a factor:
> mode(world1$cont)
[1] "character"
> class(world1$cont)
[1] "character"
> world1$cont = factor(world1$cont)

In many situations, the cont variable will behave the same as it did when it was a simple character variable, but notice that its mode and class have changed:
> mode(world1$cont)
[1] "numeric"
> class(world1$cont)
[1] "factor"

Having made cont into a factor, we need to choose some colors to represent the different continents. There are a few ways to tell R what colors you want to use. The easiest is to just use a color's name. Most colors you think of will work, but you can run the colors function without an argument to see the official list. You can also use the method that's commonly use by web designers, where colors are specified as a pound sign (#) followed by 3 sets of hexadecimal digits providing the levels of red, green and blue, respectively. Using this scheme, red is represented as '#FF0000', green as '#00FF00', and blue as '#0000FF'. To see how many unique values of cont there are, we can use the levels function, since it's a factor. (For non-factors, the unique function is available, but it may give the levels in an unexpected order.)
> levels(world1$cont)
[1] "AF" "AS" "EU" "NA" "OC" "SA"

There are six levels. The first step is to create a vector of color values:
mycolors = c('red','yellow','blue','green','orange','violet')

To make the best possible graph, you should probably be more careful when choosing the colors, but this will serve as a simple example.
Now, when we make the scatterplot, we add an additional argument, col=, which is a vector of the same length as the number of pairs of points that we're plotting - the color in each position corresponds to the color that will be used to draw that point on the graph. Probably the easiest way to do that is to use the value of the factor cont as a subscript to the mycolors vector that we created earlier. (If you don't see why this does what we want, please take a look at the result of mycolors[world1$cont]).
with(world1,plot(literacy,income,col=mycolors[cont]))

There's one more detail that we need to take care of. Since we're using color on the graph, we have to provide some way that someone viewing the graph can tell which color represents which continent, i.e. we need to add a legend to the graph. In R, this is done with the legend command. There are many options to this command, but in it's simplest form we just tell R where to put the legend, whether we should show points or lines, and what colors they should be. A title for the legend can also be added, which is a good idea in this example, because the meaning of the continent abbreviations may not be immediately apparent. You can specify x- and y-coordinates for the legend location or you can use one of several shortcuts like "topleft" to do things automatically. (You may also want to look at the locator command, that lets you decide where to place your legends interactively). For our example, the following will place a legend in an appropriate place; the title command is also used to add a title to the plot:
 with(world1,legend('topleft',legend=levels(cont),col=mycolors,pch=1,title='Continent'))
title('Income versus Literacy for Countries around the World')

Here's what the plot looks like:

2  Taking More Control Over Graphics

Although consulting the help file for a particular plotting function will often yield useful information, the R graphics system relies on a general method for setting a variety of graphical parameters through the par function. You should definitely familiarize yourself with the capabilities of this function before trying to customize any graphics. Two parameters that you will probably want to use are xlim= and ylim=. These parameters each accept a vector of length two, showing the minimum and maximum values that will be displayed on the x- and y-axes, respectively. For example, suppose we are investigating the relationship between income and military spending in the world1 data frame:
> plot(world1$income,world1$military)

The problem is that the large outlier for military spending makes it very difficult to see the relationships among the other points. To resolve this problem, we can replot the graph, using the ylim= parameter to restrict the y-axis from 0 to 1e+11:
plot(world1$income,world1$military,ylim=c(0,1e11))

Many other graphics parameters exist to control things like the size and spacing of axis labels, the number of tick marks on the axes, the size of your plot and many other details.

3  Using Dates in R

Dates on computers have been the source of much anxiety, especially at the turn of the century, when people felt that many computers wouldn't understand the new millenium. These fears were based on the fact that certain programs would store the value of the year in just 2 digits, causing great confusion when the century "turned over". In R, dates are stored as they have traditionally been stored on Unix computers - as the number of days from a reference date, in this case January 1, 1970, with earlier days being represented by negative numbers. When dates are stored this way, they can be manipulated like any other numeric variable (as far as it makes sense). In particular, you can compare or sort dates, take the difference between two dates, or add an increment of days, weeks, months or years to a date. The class of such dates is Date and their mode is numeric. Dates are created with as.Date, and formatted for printing with format (which will recognize dates and do the right thing.)
Because dates can be written in so many different formats, R uses a standard way of providing flexibility when reading or displaying dates. A set of format codes, some of which are shown in the table below, is used to describe what the input or output form of the date looks like. The default format for as.Date is a four digit year, followed by a month, then a day, separated by either dashes or slashes. So conversions like this happen automatically:
> as.Date('1915-6-16')
[1] "1915-06-16"
> as.Date('1890/2/17')
[1] "1890-02-17"

The formatting codes are as follows:
CodeValue
%dDay of the month (decimal number)
%mMonth (decimal number)
%bMonth (abbreviated)
%BMonth (full name)
%yYear (2 digit)
%YYear (4 digit)
(For a complete list of the format codes, see the R help page for the strptime function.)
As an example of reading dates, the URL http://www.stat.berkeley.edu/classes/s133/data/movies.txt contains the names, release dates, and box office earnings for around 700 of the most popular movies of all time. The first few lines of the input file look like this:
Rank|name|box|date
1|Titanic|$600.788|December 19, 1997
2|Avatar|$594.472|December 18, 2009
3|The Dark Knight|$529.143|July 18, 2008

As can be seen, the fields are separated by vertical bars, so we can use read.delim with the appropriate sep= argument.
> movies = read.delim('http://www.stat.berkeley.edu/classes/s133/data/movies.txt',
+ sep='|',stringsAsFactors=FALSE)
> head(movies)
  Rank                               name      box              date
1    1                            Titanic $600.788 December 19, 1997
2    2                             Avatar $594.472 December 18, 2009
3    3                    The Dark Knight $529.143     July 18, 2008
4    4 Star Wars: Episode IV - A New Hope $460.998      May 25, 1977
5    5                            Shrek 2 $436.471      May 19, 2004
6    6         E.T. the Extra-Terrestrial $433.005     June 11, 1982

The first step in using a data frame is making sure that we know what we're dealing with. A good first step is to use the sapply function to look at the mode of each of the variables:
> sapply(movies,mode)
       rank        name         box        date
  "numeric" "character" "character" "character"

Unfortunately, the box office receipts (box) are character, not numeric. That's because R doesn't recognize a dollar sign ($) as being part of a number. (R has the same problem with commas.) We can remove the dollar sign with the sub function, and then use as.numeric to make the result into a number:
> movies$box = as.numeric(sub('\\$','',movies$box))

To convert the character date values to R Date objects, we can use as.Date with the appropriate format: in this case it's the month name followed by the day of the month, a comma and the four digit year. Consulting the table of format codes, this translates to '%B %d, %Y':
> movies$date = as.Date(movies$date,'%B %d, %Y')
> head(movies$date)
[1] "1997-12-19" "2009-12-18" "2008-07-18" "1977-05-25" "2004-05-19"
[6] "1982-06-11"

The format that R now uses to print the dates is the standard Date format, letting us know that we've done the conversion correctly. (If we wanted to recover the original format, we could use the format function with a format similar to the one we used to read the data.)
Another way to create dates is with the ISOdate function. This function accepts three numbers representing the year, month and day of the date that is desired. So to reproduce the last date in the previous vector, we could use
> lastdate = ISOdate(2002,5,3)
> lastdate
[1] "2002-05-03 12:00:00 GMT"

Notice that, along with the date, a time is printed. That's because ISOdate returns an object of class POSIXt, not Date. To make a date like this work properly with objects of class Date, you can use the as.Date function.
Once we've created an R Date value, we can use the functions months, weekdays or quarters to extract those parts of the date. For example, to see which day of the week these very popular movies were released, we could use the table function combined with weekdays:
 
> table(weekdays(movies$date))
   Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
      738        13         9         9        42        24       165 

Notice that the ordering of the days is not what we'd normally expect. This problem can be solved by creating a factor that has the levels in the correct order:
> movies$weekday = factor(weekdays(movies$date),
+    levels = c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'),ordered=TRUE)

Now we can use weekday to get a nicer table:
> table(movies$weekday)
   Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
       13        24       165        42       738         9         9 

Similarly, if we wanted to graphically display a chart showing which month of the year the popular movies were released in, we could first create an ordered factor, then use the barplot function:
> movies$month = factor(months(movies$date),levels=c('January','February','March','April','May','June','July','August','September','October','November','December'),ordered=TRUE)
> barplot(table(movies$month))

To do a similar thing with years, we'd have to create a new variable that represented the year using the format function. For a four digit year the format code is %Y, so we could make a table of the hit movies by year like this:
> table(format(movies$date,'%Y'))
1938 1939 1940 1942 1946 1950 1953 1955 1956 1959 1961 1963 1964 1965 1967 1968 
   1    1    1    1    1    1    1    1    1    1    1    1    2    3    3    2 
1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 
   1    5    2    3    2    7    3    5    4    6   11   10    8   11   14   11 
1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 
  11   12   15   15   24   20   23   28   24   20   35   28   34   41   41   46 
2001 2002 2003 2004 2005 2006 2007 2008 2009 
  47   49   56   58   50   61   46   51   40 




File translated from TEX by TTH, version 3.67.
On 1 Feb 2010, 13:18.